Objective: We aim to develop an open-source natural language processing (NLP) package, SODA (i.e., SOcial DeterminAnts), with pre-trained transformer models to extract social determinants of health (SDoH) for cancer patients, examine the generalizability of SODA to a new disease domain (i.e., opioid use), and evaluate the extraction rate of SDoH using cancer populations. Methods: We identified SDoH categories and attributes and developed an SDoH corpus using clinical notes from a general cancer cohort. We compared four transformer-based NLP models to extract SDoH, examined the generalizability of NLP models to a cohort of patients prescribed with opioids, and explored customization strategies to improve performance. We applied the best NLP model to extract 19 categories of SDoH from the breast (n=7,971), lung (n=11,804), and colorectal cancer (n=6,240) cohorts. Results and Conclusion: We developed a corpus of 629 cancer patients notes with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH. The Bidirectional Encoder Representations from Transformers (BERT) model achieved the best strict/lenient F1 scores of 0.9216 and 0.9441 for SDoH concept extraction, 0.9617 and 0.9626 for linking attributes to SDoH concepts. Fine-tuning the NLP models using new annotations from opioid use patients improved the strict/lenient F1 scores from 0.8172/0.8502 to 0.8312/0.8679. The extraction rates among 19 categories of SDoH varied greatly, where 10 SDoH could be extracted from >70% of cancer patients, but 9 SDoH had a low extraction rate (<70% of cancer patients). The SODA package with pre-trained transformer models is publicly available at https://github.com/uf-hobiinformatics-lab/SDoH_SODA.
translated by 谷歌翻译
Machine Translation (MT) system generally aims at automatic representation of source language into target language retaining the originality of context using various Natural Language Processing (NLP) techniques. Among various NLP methods, Statistical Machine Translation(SMT). SMT uses probabilistic and statistical techniques to analyze information and conversion. This paper canvasses about the development of bilingual SMT models for translating English to fifteen low-resource Indian Languages (ILs) and vice versa. At the outset, all 15 languages are briefed with a short description related to our experimental need. Further, a detailed analysis of Samanantar and OPUS dataset for model building, along with standard benchmark dataset (Flores-200) for fine-tuning and testing, is done as a part of our experiment. Different preprocessing approaches are proposed in this paper to handle the noise of the dataset. To create the system, MOSES open-source SMT toolkit is explored. Distance reordering is utilized with the aim to understand the rules of grammar and context-dependent adjustments through a phrase reordering categorization framework. In our experiment, the quality of the translation is evaluated using standard metrics such as BLEU, METEOR, and RIBES
translated by 谷歌翻译
Position modeling plays a critical role in Transformers. In this paper, we focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We define attention resolution as an indicator of extrapolation. Then we propose two designs to improve the above metric of Transformers. Specifically, we introduce a relative position embedding to explicitly maximize attention resolution. Moreover, we use blockwise causal attention during inference for better resolution. We evaluate different Transformer variants with language modeling. Experimental results show that our model achieves strong performance in both interpolation and extrapolation settings. The code will be available at https://aka.ms/LeX-Transformer.
translated by 谷歌翻译
Deep neural networks (DNN) are prone to miscalibrated predictions, often exhibiting a mismatch between the predicted output and the associated confidence scores. Contemporary model calibration techniques mitigate the problem of overconfident predictions by pushing down the confidence of the winning class while increasing the confidence of the remaining classes across all test samples. However, from a deployment perspective, an ideal model is desired to (i) generate well-calibrated predictions for high-confidence samples with predicted probability say >0.95, and (ii) generate a higher proportion of legitimate high-confidence samples. To this end, we propose a novel regularization technique that can be used with classification losses, leading to state-of-the-art calibrated predictions at test time; From a deployment standpoint in safety-critical applications, only high-confidence samples from a well-calibrated model are of interest, as the remaining samples have to undergo manual inspection. Predictive confidence reduction of these potentially ``high-confidence samples'' is a downside of existing calibration approaches. We mitigate this by proposing a dynamic train-time data pruning strategy that prunes low-confidence samples every few epochs, providing an increase in "confident yet calibrated samples". We demonstrate state-of-the-art calibration performance across image classification benchmarks, reducing training time without much compromise in accuracy. We provide insights into why our dynamic pruning strategy that prunes low-confidence training samples leads to an increase in high-confidence samples at test time.
translated by 谷歌翻译
机器翻译系统(MTS)是通过将文本或语音从一种语言转换为另一种语言的有效工具。在像印度这样的大型多语言环境中,对有效的翻译系统的需求变得显而易见,英语和一套印度语言(ILS)正式使用。与英语相反,由于语料库的不可用,IL仍然被视为低资源语言。为了解决不对称性质,多语言神经机器翻译(MNMT)系统会发展为在这个方向上的理想方法。在本文中,我们提出了一个MNMT系统,以解决与低资源语言翻译有关的问题。我们的模型包括两个MNMT系统,即用于英语印度(一对多),另一个用于指示英语(多一对多),其中包含15个语言对(30个翻译说明)的共享编码器码头。由于大多数IL对具有很少的平行语料库,因此不足以训练任何机器翻译模型。我们探索各种增强策略,以通过建议的模型提高整体翻译质量。最先进的变压器体系结构用于实现所提出的模型。大量数据的试验揭示了其优越性比常规模型的优势。此外,本文解决了语言关系的使用(在方言,脚本等方面),尤其是关于同一家族的高资源语言在提高低资源语言表现方面的作用。此外,实验结果还表明了ILS的倒退和域适应性的优势,以提高源和目标语言的翻译质量。使用所有这些关键方法,我们提出的模型在评估指标方面比基线模型更有效,即一组ILS的BLEU(双语评估研究)得分。
translated by 谷歌翻译
神经场通过将坐标输入映射到采样值来模型信号。从视觉,图形到生物学和天文学的许多领域,它们正成为越来越重要的主链体系结构。在本文中,我们探讨了这些网络中常见的调理机制之间的差异,这是将神经场从信号的记忆转移到概括的基本要素,其中共同建模了位于歧管上的一组信号。特别是,我们对这些机制的缩放行为感兴趣,以对日益高维的调理变量感兴趣。正如我们在实验中显示的那样,高维条件是建模复杂数据分布的关键,因此,确定哪种体系结构在处理此类问题时最能实现哪种选择。为此,我们运行了使用串联,超网络和基于注意力的调理策略对2D,3D和4D信号进行建模的实验,这是文献中尚未进行的必要但费力的努力。我们发现,基于注意力的条件在各种环境中的其他方法都优于其他方法。
translated by 谷歌翻译
我们建议在2D域中利用自我监督的技术来实现细粒度的3D形状分割任务。这是受到观察的启发:基于视图的表面表示比基于点云或体素占用率的3D对应物更有效地建模高分辨率表面细节和纹理。具体而言,给定3D形状,我们将其从多个视图中渲染,并在对比度学习框架内建立密集的对应学习任务。结果,与仅在2D或3D中使用自学的替代方案相比,学到的2D表示是视图不变和几何一致的,在对有限的标记形状进行培训时,可以更好地概括概括。对纹理(渲染peple)和未纹理(partnet)3D数据集的实验表明,我们的方法在细粒部分分割中优于最先进的替代方案。当仅一组稀疏的视图可供训练或形状纹理时,对基准的改进就会更大,这表明MVDecor受益于2D处理和3D几何推理。
translated by 谷歌翻译
在自动驾驶符号识别等任务中,强大的分类至关重要,因为错误分类的弊端可能是严重的。对抗性攻击威胁着神经网络分类器的鲁棒性,导致它们始终如一,自信地误导了道路标志。一种这样的攻击,基于阴影的攻击,通过应用自然的阴影来输入图像引起误解,从而导致人类观察者看起来很自然,但对这些分类器感到困惑。当前针对此类攻击的防御能力采用简单的对抗训练程序,分别在GTSRB和LISA测试集上获得相当低的25 \%和40 \%的鲁棒性。在本文中,我们提出了一种健壮,快速且可推广的方法,旨在在道路标志识别的背景下防御阴影攻击,以增强具有二进制自适应阈值和边缘图的源图像。我们从经验上表明了它针对影子攻击的稳健性,并重新制定了该问题,以表明其相似性$ \ varepsilon $基于扰动的攻击。实验结果表明,我们的边缘防御能力达到78 \%的鲁棒性,同时在GTSRB测试集上保持98 \%的良性测试精度,这是我们阈值防御的类似结果。链接到我们的代码是在论文中。
translated by 谷歌翻译
FOLD-R ++是一种针对二进制分类任务的高效且基于规则的机器学习算法。它以(可解释的)训练有素的模型生成分层的正常逻辑程序。我们对称为fold-se的fold-r ++算法进行了改进,该算法在继承fold-r ++的所有优点时提供了可扩展的解释性(SE)。可扩展的解释性意味着,无论数据集的大小如何,学识渊博的规则和学识关的数量保持很小,因此人类可以理解,同时保持分类的良好表现。 Fold-SE具有最新的算法(例如XGBoost和Multi-Layer Perceptrons(MLP))的性能竞争力。但是,与XGBoost和MLP不同,Fold-SE算法生成具有可扩展性的模型。 FOLD-SE算法在效率,性能和解释性方面优于fold-r ++和开膛手算法,尤其是对于大型数据集。 fold-rm算法是用于多类分类任务的fold-r ++的扩展。还提出了一种改进的折叠式RM算法。
translated by 谷歌翻译
预测可帮助企业分配资源并实现目标。在LinkedIn,产品所有者使用预测来设定业务目标,跟踪前景和监视健康。工程师使用预测有效地提供硬件。开发一种预测解决方案来满足这些需求,需要对各种时间序列进行准确,可解释的预测,并以次数至季度的频率。我们提出了Greykite,这是一个用于预测的开源Python库,已在LinkedIn上部署了二十多种用例。它的旗舰算法Silverkite提供了可解释的,快速且高度灵活的单变量预测,可捕获诸如时期增长和季节性,自相关,假期和回归剂等效果。该库通过促进数据探索,模型配置,执行和解释来实现自我服务的准确性和信任。我们的基准结果显示了来自各个域的数据集的现成速度和准确性。在过去的两年中,金融,工程和产品团队的资源计划和分配,目标设置和进度跟踪,异常检测和根本原因分析的资源团队一直信任灰金矿的预测。我们希望灰金矿对具有类似应用的预测从业者有用,这些应用需要准确,可解释的预测,这些预测捕获了与人类活动相关的时间序列共有的复杂动力学。
translated by 谷歌翻译